Introduction to Open Data Science - Course Project

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Tue Dec  5 08:21:24 2023"

Chapter 1

Getting started

Link to GitHub repository: https://github.com/quadrangle1/IODS-project/ Link to course diary page: https://quadrangle1.github.io/IODS-project/

In the exercise set 1 there are a lot of useful references, when it is needed to revise how to use some basic commands. Most of the content of exercise 1 was already at least somewhat familiar to me before this course.


Assignment 2 - Regression and model validation

This week I have used a linear regression model to study the relationship of some independent variables to an outcome variable. I have learned about the workflow in fitting the model and assessing the quality of the model.

date()
## [1] "Tue Dec  5 08:21:24 2023"

The workflow begins by reading in the data prepared in the data wrangling part of the assignment. In essence, the studied dataset is metadata of the participants combined to data from a learning style questionnaire. The dataset consists of 166 rows and 7 variables (gender, age, attitude, exam points, and three collapsed learning style variables.)

# Load needed packages
suppressMessages(library(tidyverse))

# Read in the data
lrn14 <- read.csv("/Users/hchaltia/Open data science course/IODS-project/data/lrn14_subset_for_analysis.csv")

#Explore the data
dim(lrn14)
str(lrn14)

Next we will explore the data graphically and look at the summaries of the variables.

# Load needed packages
suppressMessages(library(ggplot2))
suppressMessages(library(GGally))

# Create a scatterplot matrix (exclude variable "gender")
pairs(lrn14[-2])

# Create a plot matrix to investigate the relationships of the variables
p <- ggpairs(lrn14, mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)))
p

By eyeballing the scatterplot matrix, it looks like attitude and exam points are positively correlated. Age does not correlate well with anything.

Checking the more advanced plot next, we can see that the correlation between attitude and exam points is not extremely strong, 0.437, but it is still the highest correlation value in any of the pairwise comparisons. The distributions of the variables seem to be fairly similar between the two studied genders, except the distributions of attitude and surface level learning compound variable, where the peaks of the distributions differ slightly.

Overall, the study included almost twice the number of women compared to men.

# Check summaries of the different variables
lrn14 %>% summary()
##       Age           gender             Attitude         Points     
##  Min.   :17.00   Length:166         Min.   :14.00   Min.   : 7.00  
##  1st Qu.:21.00   Class :character   1st Qu.:26.00   1st Qu.:19.00  
##  Median :22.00   Mode  :character   Median :32.00   Median :23.00  
##  Mean   :25.51                      Mean   :31.43   Mean   :22.72  
##  3rd Qu.:27.00                      3rd Qu.:37.00   3rd Qu.:27.75  
##  Max.   :55.00                      Max.   :50.00   Max.   :33.00  
##       surf            stra            deep      
##  Min.   :1.583   Min.   :1.250   Min.   :1.583  
##  1st Qu.:2.417   1st Qu.:2.625   1st Qu.:3.333  
##  Median :2.833   Median :3.188   Median :3.667  
##  Mean   :2.787   Mean   :3.121   Mean   :3.680  
##  3rd Qu.:3.167   3rd Qu.:3.625   3rd Qu.:4.083  
##  Max.   :4.333   Max.   :5.000   Max.   :4.917

From the summaries we can gather that the mean age of the participants is 25.51 years with a range of 17 to 55 years. Looking at the compound learning variables, the highest points on average are from deep learning, while surface level learning has average points of less than 2.

Next, we will start creating a linear regression model with three explanatory variables, and exam points as the outcome variable. Let’s choose attitude, surface level learning and strategic level learning as the explanatory variables, because the correlation with the outcome variable is the highest in those.

# Fit the model
fit1 <- lm(Points ~ Attitude + surf + stra, data = lrn14)

# Show the summary of the model
summary(fit1)
## 
## Call:
## lm(formula = Points ~ Attitude + surf + stra, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## Attitude     0.33952    0.05741   5.913 1.93e-08 ***
## surf        -0.58607    0.80138  -0.731  0.46563    
## stra         0.85313    0.54159   1.575  0.11716    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

From the summary we see that out of these three covariates, only attitude is significant in predicting the exam points. Generally the model is not very good, explaining only about 19 percent of the observed variation in the outcome variable (adjusted R-squared). The significance of the coefficients is based on the t-value, which tells how far the estimated parameter is from a hypothesized 0. The F test determines whether at least one feature has a significant effect.

According to the assignment instructions, let’s fit the model using the significant explanatory variable.

# Fit the model
fit2 <- lm(Points ~ Attitude, data = lrn14)

# Show the summary of the model
summary(fit2)
## 
## Call:
## lm(formula = Points ~ Attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

With only attitude as the covariate, the t value based p-value is 4.12e-09, which is considered significant in this context. The adjusted R-squared tells that the model explains about 19 percent of the observed variation in the outcome variable. In essence, this model is as good as the previous model.

Next we will produce diagnostic plots to assess model quality graphically.

# Draw plots in 2x2 format
par(mfrow = c(2,2))
plot(fit2, which = c(1,2,5))

The linear model assumes a linear relationship between the studied variables, normal distribution of errors, independence of observations and homoscedasticity of errors.

The residuals vs fitted plot reveals non-linear patterns in the data. When the red line shown in the plot is horizontal, the linearity can be assumed.

The Q-Q plot studies the distribution of the residuals. If the plots are on a diagonal line, normality can be assumed.

The residuals vs. leverage plot studies if there are any overly influential observations in the dataset. If the datapoints are under the dashed line titled “Cook’s distance”, those data points are considered as overly influential.

According to these plots, this model seems to be adequate.


Assignment 3 - Logistic regression

The data of this assignment has been collected from two Portuguese schools, and regards the performance of students in both Portuguese language, and mathematics. In addition to the school performance, the dataset contains information about the students (age, sex, alcohol use) and about the socioeconomic factors related to the students (education of their parents). The data contains 370 observations and 35 variables.

Let’s start by reading in the data.

suppressMessages(library(readr))
alc <- read_csv("/Users/hchaltia/Open data science course/IODS-project/data/alc_dataset.csv")
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl  (1): high_use
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

For the sake of this assignment, I will choose sex, age, mother’s education, and father’s education as the independent variables. The model will study the association of these variables to the high/low alcohol usage variable. My hypothesis is that male sex, higher age, and lower parents’ education level associates to higher alcohol consumption.

Let’s explore these variables with numerical and graphical presentations.

# Load needed packages
suppressMessages(library(ggplot2))
suppressMessages(library(GGally))
suppressMessages(library(tidyverse))

# Create an analysis dataset with only the chosen variables and explore its dimensions
alc_analysis <- alc %>% select(age, sex, Medu, Fedu, high_use)
str(alc_analysis)
## tibble [370 × 5] (S3: tbl_df/tbl/data.frame)
##  $ age     : num [1:370] 18 17 15 15 16 16 16 17 15 15 ...
##  $ sex     : chr [1:370] "F" "F" "F" "F" ...
##  $ Medu    : num [1:370] 4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu    : num [1:370] 4 1 1 2 3 3 2 4 2 4 ...
##  $ high_use: logi [1:370] FALSE FALSE TRUE FALSE FALSE FALSE ...
# Convert Medu and Fedu into factors
alc_analysis$Medu <- as.factor(alc_analysis$Medu)
alc_analysis$Fedu <- as.factor(alc_analysis$Fedu)
str(alc_analysis)
## tibble [370 × 5] (S3: tbl_df/tbl/data.frame)
##  $ age     : num [1:370] 18 17 15 15 16 16 16 17 15 15 ...
##  $ sex     : chr [1:370] "F" "F" "F" "F" ...
##  $ Medu    : Factor w/ 5 levels "0","1","2","3",..: 5 2 2 5 4 5 3 5 4 4 ...
##  $ Fedu    : Factor w/ 5 levels "0","1","2","3",..: 5 2 2 3 4 4 3 5 3 5 ...
##  $ high_use: logi [1:370] FALSE FALSE TRUE FALSE FALSE FALSE ...
# Draw a bar plot of each variable
gather(alc_analysis) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables; they will be
## dropped

Based on the bar plots, the age distribution of the students is, not surprisingly, skewed towards 15-18 years with a few older students recorded as well. There are slightly more females than males, and majority of the students do not have high alcohol use. In both mothers and fathers, level 0 education is very uncommon.

Let’s draw some more plots to explore the relationships of the variables.

# Initialize a plot of high_use and age
g1 <- ggplot(alc_analysis, aes(high_use, age, col = sex))

# Define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("age")

# Plot mother's education against high_use
ggplot(alc_analysis, aes(high_use, Medu, col = Medu)) + geom_point(position = "jitter") + facet_wrap("sex") +
  ggtitle("Mother's education status in high and low alc groups faceted by sex")

# Plot father's education against high_use
ggplot(alc_analysis, aes(high_use, Fedu, col = Fedu)) + geom_point(position = "jitter") + facet_wrap("sex") +
  ggtitle("Father's education status in high and low alc groups faceted by sex")

# Initialize a plot of high_use and sex
ggplot(alc_analysis, aes(high_use, sex, col = sex)) + geom_point(position = "jitter") + 
  ggtitle("High and low alcohol use in males and females")

Based on the first boxplot, it looks like in males the high alcohol consumption group is on average older than the low use group, which does not seem a strong phenomenon visible in females. In females, mother’s level 2 education seems to have less high alcohol users, whereas in males this is not true. The same level 2 education phenomenon is visible in females also in terms of father’s education, but not in males. Based on the last plot, it looks like high alcohol use is more common in males. Thus, the hypotheses about age and sex seem to be promising, but not so much the initial education hypothesis.

Let’s still look at some numerical summary values.

# Produce summary statistics by group
alc_analysis %>% group_by(sex, high_use) %>% summarise(count = n(), mean_age = mean(age))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
## # Groups:   sex [2]
##   sex   high_use count mean_age
##   <chr> <lgl>    <int>    <dbl>
## 1 F     FALSE      154     16.6
## 2 F     TRUE        41     16.5
## 3 M     FALSE      105     16.3
## 4 M     TRUE        70     16.9

Looking at the table, it is indeed more common in males to belong in the high alcohol use group. Mean ages of the groups are not very different examined this way.

Next, let’s fit a logistic regression model on the data using the chosen variables. I want to compare other education levels to the highest education level, so let’s first relevel the factors to have education level 4 as the reference.

# Relevel mother and father's education variables to have education level 4 as the reference value
alc_analysis <- within(alc_analysis, Medu <- relevel(Medu, ref = 5))
alc_analysis <- within(alc_analysis, Fedu <- relevel(Fedu, ref = 5))

# Fit the model
m <- glm(high_use ~ Medu + Fedu + sex + age, data = alc_analysis, family = "binomial")
m2 <- glm(high_use ~ Medu + sex + age, data = alc_analysis, family = "binomial")
m3 <- glm(high_use ~ Fedu + sex + age, data = alc_analysis, family = "binomial")
m4 <- glm(high_use ~ sex + age, data = alc_analysis, family = "binomial")

# Print a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ Medu + Fedu + sex + age, family = "binomial", 
##     data = alc_analysis)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.08975    1.69359  -2.415   0.0157 *  
## Medu0         1.70131    1.28951   1.319   0.1871    
## Medu1         0.44254    0.47222   0.937   0.3487    
## Medu2        -0.49006    0.39917  -1.228   0.2196    
## Medu3         0.53865    0.32453   1.660   0.0970 .  
## Fedu0       -14.28418  614.68104  -0.023   0.9815    
## Fedu1        -0.08473    0.45649  -0.186   0.8527    
## Fedu2        -0.13582    0.38138  -0.356   0.7218    
## Fedu3        -0.34310    0.34689  -0.989   0.3226    
## sexM          0.98711    0.24232   4.074 4.63e-05 ***
## age           0.16674    0.10212   1.633   0.1025    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 416.96  on 359  degrees of freedom
## AIC: 438.96
## 
## Number of Fisher Scoring iterations: 13
summary(m2)
## 
## Call:
## glm(formula = high_use ~ Medu + sex + age, family = "binomial", 
##     data = alc_analysis)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.2774     1.6761  -2.552   0.0107 *  
## Medu0         1.7052     1.2610   1.352   0.1763    
## Medu1         0.4069     0.3712   1.096   0.2730    
## Medu2        -0.5040     0.3337  -1.510   0.1310    
## Medu3         0.4869     0.3003   1.622   0.1049    
## sexM          0.9611     0.2411   3.987 6.69e-05 ***
## age           0.1709     0.1007   1.698   0.0895 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 420.11  on 363  degrees of freedom
## AIC: 434.11
## 
## Number of Fisher Scoring iterations: 4
summary(m3)
## 
## Call:
## glm(formula = high_use ~ Fedu + sex + age, family = "binomial", 
##     data = alc_analysis)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.48890    1.66806  -2.691  0.00712 ** 
## Fedu0       -14.09651  623.22338  -0.023  0.98195    
## Fedu1        -0.02766    0.35030  -0.079  0.93706    
## Fedu2        -0.19495    0.31899  -0.611  0.54111    
## Fedu3        -0.26825    0.32673  -0.821  0.41165    
## sexM          0.95384    0.23641   4.035 5.47e-05 ***
## age           0.19766    0.10010   1.975  0.04831 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 429.07  on 363  degrees of freedom
## AIC: 443.07
## 
## Number of Fisher Scoring iterations: 13
summary(m4)
## 
## Call:
## glm(formula = high_use ~ sex + age, family = "binomial", data = alc_analysis)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.72614    1.65095  -2.863   0.0042 ** 
## sexM         0.93259    0.23552   3.960 7.51e-05 ***
## age          0.20425    0.09812   2.082   0.0374 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 431.73  on 367  degrees of freedom
## AIC: 437.73
## 
## Number of Fisher Scoring iterations: 4

Based on the model iterations, mother’s and father’s education levels do not seem to be significant predictors, so let’s continue with sex and age.

Let’s look at the resulting odds ratios and confidence intervals based on the model coefficients.

# Compute odds ratios (OR)
OR <- coef(m4) %>% exp

# Compute confidence intervals (CI)
CI <- confint(m4) %>% exp()
## Waiting for profiling to be done...
# Print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                      OR        2.5 %    97.5 %
## (Intercept) 0.008860572 0.0003296194 0.2173513
## sexM        2.541071530 1.6088411176 4.0567456
## age         1.226598777 1.0134340303 1.4905305

Male sex and age both were significant predictors in the model, and their confidence intervals do not include value 1, which would be a sign of insignificance.

In the next part, let’s prepare a cross tabulation using the model predictions.

suppressMessages(library(dplyr))

# Predict() the probability of high_use
probabilities <- predict(m4, type = "response")

# Add the predicted probabilities to alc_analysis
alc_analysis <- mutate(alc_analysis, probability = probabilities)

# Use the probabilities to make a prediction of high_use
alc_analysis <- mutate(alc_analysis, prediction = probability > 0.5)

# Tabulate the target variable versus the predictions
table(high_use = alc_analysis$high_use, prediction = alc_analysis$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   254    5
##    TRUE    108    3

Let’s compute the total proportion of inaccurately classified individuals.

# Define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# Call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc_analysis$high_use, prob = alc_analysis$probability)
## [1] 0.3054054

The mean prediction error is about 0.31. Looking at the cross-tabulation, the model is pretty bad at predicting, because almost all true high users are categorized as non-high users. Based on the model quality assessment, the chosen variables are not adequate at predicting high alcohol use.

Let’s try the 10-fold cross-validation on the model.

# Define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc_analysis, cost = loss_func, glmfit = m4, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.3054054

The cross validation does not seem to give any better results than the originally used method.


Assignment 4 - Clustering and classification

The data used in this assignment is a dataset called “Boston” from the MASS package. The dataset contains variables related to housing values in suburbs of Boston, such as median value and crime rate per capita. The dataset contains 506 observations and 14 variables, all of which are either numerical or integer variables.

Let’s start by loading in the data.

# Load the required package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# Activate Boston dataset
data("Boston")

# Explore the dimensions
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Next, let’s take a look at the variables and their distributions.

# Load needed packages
suppressMessages(library(ggplot2))
suppressMessages(library(GGally))
suppressMessages(library(tidyverse))

# Create a plot matrix to investigate the relationships of the variables
p <- ggpairs(Boston, lower = list(combo = wrap("facethist", bins = 20)))
p

The distributions of the variables are quite different, as are the ranges of the values.

Let’s still look at the numerical summaries of the variables.

# Take the summaries of the Boston variables
library(MASS)
data("Boston")
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Again it is visible that some variables have very small values, whereas others have values measured in hundreds.

Next, let’s standardize the data.

# Scale data
boston_scaled <- as.data.frame(scale(Boston))

# Summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

Now the variables have similar ranges and it’s easier to compare them.

According to the instructions, let’s create a new crime variable and drop the old one.

# Summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# Create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# Create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE)

# Remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# Add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Next, we will divide the scaled data into training and test datasets.

# Number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# Randomly choose 80% of the rows
ind <- sample(nrow(boston_scaled),  size = n * 0.8)

# Create train set
train <- boston_scaled[ind,]

# Create test set 
test <- boston_scaled[-ind,]

# Save the correct classes from test data
correct_classes <- test$crime

# Remove the crime variable from test data
test <- dplyr::select(test, -crime)

Next, let’s fit the linear discriminant analysis and draw the LDA biplot using the training dataset.

# Linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# Print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
## [-0.419,-0.411]  (-0.411,-0.39] (-0.39,0.00739]  (0.00739,9.92] 
##       0.2500000       0.2574257       0.2326733       0.2599010 
## 
## Group means:
##                         zn      indus         chas        nox          rm
## [-0.419,-0.411]  0.8941678 -0.9117024 -0.155385496 -0.8747105  0.43456940
## (-0.411,-0.39]  -0.1335038 -0.2732605  0.030524797 -0.5490194 -0.13210193
## (-0.39,0.00739] -0.3859772  0.1921352  0.230279473  0.3857392 -0.01198607
## (0.00739,9.92]  -0.4872402  1.0170492 -0.009855719  1.0399323 -0.47134808
##                        age        dis        rad        tax     ptratio
## [-0.419,-0.411] -0.8574165  0.8541959 -0.6987343 -0.7494282 -0.41209686
## (-0.411,-0.39]  -0.3387177  0.3224554 -0.5534047 -0.5040926 -0.02387471
## (-0.39,0.00739]  0.4501370 -0.4102680 -0.4491780 -0.3391697 -0.30819978
## (0.00739,9.92]   0.8136793 -0.8460830  1.6388211  1.5145512  0.78158339
##                       black       lstat         medv
## [-0.419,-0.411]  0.37679484 -0.75623084  0.519412917
## (-0.411,-0.39]   0.31801232 -0.16043265  0.005842289
## (-0.39,0.00739]  0.05712155  0.06047145  0.072428217
## (0.00739,9.92]  -0.86477021  0.92735094 -0.708965051
## 
## Coefficients of linear discriminants:
##                  LD1          LD2          LD3
## zn       0.111658917  0.670743965 -0.960019919
## indus    0.105849105 -0.258745835  0.336561497
## chas    -0.086226021 -0.115134051  0.176523464
## nox      0.353184429 -0.684533448 -1.273649269
## rm      -0.128384975 -0.040661683 -0.132939285
## age      0.240133425 -0.325993448 -0.264682913
## dis     -0.001439766 -0.181323577  0.177757809
## rad      3.770494286  0.920596294  0.001290764
## tax     -0.045946618 -0.003549196  0.418497148
## ptratio  0.104102199  0.080012948 -0.126099097
## black   -0.098016054  0.043350707  0.135200583
## lstat    0.257847247 -0.191482968  0.339502817
## medv     0.196695004 -0.300663502 -0.170901425
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9635 0.0276 0.0089
# The function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# Target classes as numeric
classes <- as.numeric(train$crime)

# Plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

Next, let’s try predicting the crime rates in the test set using the LDA fit. The crime variable was already removed from the test set earlier when creating the test and training datasets.

# Predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# Cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##                  predicted
## correct           [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
##   [-0.419,-0.411]              17              9               0              0
##   (-0.411,-0.39]                5             13               4              0
##   (-0.39,0.00739]               0             11              18              3
##   (0.00739,9.92]                0              0               0             22

The prediction is pretty good in the highest crime rate category, where there is only one observation wrongly predicted. Otherwise, the model is not very bad but also not extremely good at predicting the crime rate.

Next let’s reload the Boston dataset, standardize it, and perform k-means clustering.

# Activate Boston dataset
data("Boston")

# Standardize the data
boston_scaled <- scale(Boston)

# Euclidean distance matrix
dist_eu <- dist(boston_scaled)

# Look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# Set seed for reproducibility
set.seed(7)

# k-means clustering
km <- kmeans(boston_scaled, centers = 4)

# Plot the boston_scaled dataset with clusters
pairs(boston_scaled, col = km$cluster)

# Determine the number of clusters
k_max <- 10

# Calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# Visualize the results
quickplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Based on the first clustering attempt results visualized in the plot matrix, 4 clusters is not ideal for this data. The visualized total of within cluster sum of squares also supports 2 as the most optimal cluster number for this dataset.

Let’s cluster again with k = 2.

# k-means clustering
km <- kmeans(boston_scaled, centers = 2)

# Plot the boston_scaled dataset with clusters
pairs(boston_scaled, col = km$cluster)

2 clusters seems to work better than 4 clusters for this dataset.


Dimensionality reduction

Let’s begin by loading in the data and exploring the dimensions.

# Load packages
library(tidyverse)

# Load in data
human <- read.csv("data/human.csv")
str(human)
## 'data.frame':    155 obs. of  9 variables:
##  $ country  : chr  "Norway" "Australia" "Switzerland" "Denmark" ...
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
# Set "country" as row names
human_ <- column_to_rownames(human, "country")

Next let’s explore the variables and their relations

# Summaries of the variables
summary(human_)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
# Plot matrix
library(GGally)
ggpairs(human_, mapping = aes(alpha=0.2), 
              title="Human data variables",
              lower = list(combo = wrap("facethist", bins = 20)))

All variables in the dataset are now numerical as the country information has been moved to rownames. The distributions and ranges of the variables are not equal. Some pretty strong inverse correlations are observed between maternal mortality and life expectancy at birth, and adolescent birth rate and life expectancy at birth for example.

Next let’s do PCA on the raw human data.

# Perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_)

# Draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

All data is on the upper right corner and only one PC vector is visible. The variation in the data is explained by GNI according to this plot.

Next let’s standardize the data and perform the PCA again.

# Standardize the variables
human_std <- scale(human_)

# Perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)

# Draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)

# Summary of the PCA
summary(pca_human)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000

Now the data is better spread and not all in the same corner. Also more vectors are visible dividing explaining the variance across several variables. Overall, it looks like the data needs scaling to be properly analyzed with PCA.

I think the results are different because when the data is not scaled, variables with bigger values get overestimated in their influence, since the variance in variables with smaller values is minor compared to the big values. GNI is a variable with huge numeric values compared to the others without scaling, so it becomes very influential in the first analysis.

Based on the table, PC1 explains about 53% of the variation and is affected by maternal mortality, adolescent birth rate, GNI, life expectancy at birth, and expected years of schooling at least. PC2 explains about 16% of the variation and is affected by percentage of females in parliament and ratio of females to males in the labour force.

Let’s move into the tea dataset.

# Load in the data
library(FactoMineR)
library(tidyverse)
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

# Column names to keep in the dataset
keepers <- c("Tea", "How", "where", "spirituality", "escape.exoticism")

# Select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, all_of(keepers))

# look at the summaries and structure of the data
summary(tea_time)
##         Tea         How                       where               spirituality
##  black    : 74   alone:195   chain store         :192   Not.spirituality:206  
##  Earl Grey:193   lemon: 33   chain store+tea shop: 78   spirituality    : 94  
##  green    : 33   milk : 63   tea shop            : 30                         
##                  other:  9                                                    
##              escape.exoticism
##  escape-exoticism    :142    
##  Not.escape-exoticism:158    
##                              
## 
str(tea_time)
## 'data.frame':    300 obs. of  5 variables:
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
# visualize the dataset
library(ggplot2)
pivot_longer(tea_time, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free") + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

From the chosen variables can be seen that most people have no escape exoticism related to tea, drink without adding anything with no spiritual meaning. The tea is most commonly specifically Earl Grey bought from a chain store.

Next we will do MCA on the chosen variables of the tea dataset.

# Multiple correspondence analysis
library(FactoMineR)
mca <- MCA(tea_time, graph = FALSE)

# Summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.254   0.244   0.200   0.193   0.180   0.159
## % of var.             15.476  14.136  13.549  11.118  10.717  10.021   8.838
## Cumulative % of var.  15.476  29.612  43.161  54.279  64.995  75.016  83.854
##                        Dim.8   Dim.9
## Variance               0.149   0.142
## % of var.              8.260   7.887
## Cumulative % of var.  92.113 100.000
## 
## Individuals (the 10 first)
##                         Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                    | -0.467  0.261  0.198 |  0.557  0.407  0.282 | -0.354
## 2                    |  0.065  0.005  0.002 |  0.586  0.450  0.192 | -0.602
## 3                    | -0.362  0.157  0.218 | -0.133  0.023  0.029 | -0.260
## 4                    |  0.274  0.090  0.076 | -0.743  0.724  0.557 | -0.250
## 5                    |  0.274  0.090  0.076 | -0.743  0.724  0.557 | -0.250
## 6                    | -0.362  0.157  0.218 | -0.133  0.023  0.029 | -0.260
## 7                    | -0.362  0.157  0.218 | -0.133  0.023  0.029 | -0.260
## 8                    | -0.265  0.084  0.040 |  0.816  0.873  0.382 | -0.430
## 9                    |  0.359  0.154  0.075 |  0.340  0.152  0.068 |  0.026
## 10                   |  0.052  0.003  0.002 |  0.771  0.779  0.381 |  0.008
##                         ctr   cos2  
## 1                     0.171  0.114 |
## 2                     0.495  0.202 |
## 3                     0.093  0.113 |
## 4                     0.085  0.063 |
## 5                     0.085  0.063 |
## 6                     0.093  0.113 |
## 7                     0.093  0.113 |
## 8                     0.252  0.106 |
## 9                     0.001  0.000 |
## 10                    0.000  0.000 |
## 
## Categories (the 10 first)
##                          Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black                |   0.001   0.000   0.000   0.014 |   1.280  31.755
## Earl Grey            |   0.278   3.568   0.139   6.454 |  -0.461  10.728
## green                |  -1.629  20.948   0.328  -9.901 |  -0.176   0.268
## alone                |  -0.295   4.073   0.162  -6.962 |  -0.263   3.535
## lemon                |   0.789   4.917   0.077   4.797 |  -0.063   0.034
## milk                 |   0.238   0.851   0.015   2.119 |   0.391   2.519
## other                |   1.845   7.328   0.105   5.609 |   3.195  24.077
## chain store          |  -0.272   3.404   0.132  -6.275 |  -0.187   1.760
## chain store+tea shop |   1.096  22.426   0.422  11.234 |   0.353   2.541
## tea shop             |  -1.108   8.811   0.136  -6.385 |   0.280   0.618
##                         cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                  0.536  12.663 |  -0.261   1.381   0.022  -2.585 |
## Earl Grey              0.383 -10.697 |  -0.031   0.049   0.002  -0.711 |
## green                  0.004  -1.069 |   0.765   5.276   0.072   4.649 |
## alone                  0.129  -6.199 |  -0.194   1.998   0.070  -4.562 |
## lemon                  0.000  -0.382 |   1.927  33.505   0.459  11.716 |
## milk                   0.041   3.483 |  -0.381   2.496   0.039  -3.394 |
## other                  0.316   9.717 |  -0.207   0.105   0.001  -0.629 |
## chain store            0.062  -4.312 |  -0.479  12.019   0.407 -11.033 |
## chain store+tea shop   0.044   3.614 |   0.416   3.686   0.061   4.262 |
## tea shop               0.009   1.616 |   1.982  32.200   0.436  11.421 |
## 
## Categorical variables (eta2)
##                        Dim.1 Dim.2 Dim.3  
## Tea                  | 0.341 0.544 0.082 |
## How                  | 0.239 0.384 0.465 |
## where                | 0.482 0.063 0.584 |
## spirituality         | 0.141 0.198 0.044 |
## escape.exoticism     | 0.189 0.084 0.045 |
# Visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")

Based on the MCA factor map plot, spirituality, Earl Grey, escape-exoticism, and lemon group together. Tea shop, no escape-exoticism, and green tea also somewhat group together. Chain store tea buyers seem to drink tea without adding anything. Drinking tea “other” (not alone but also no lemon or milk) way is not associated to any of the categories studied.